A Survey on text categorization of Indian and non-Indian languages using supervised learning techniques
نویسندگان
چکیده
Categorization of text plays an important role in the text mining field. Text categorization is the process in which documents are categorized into its predefined category. Automatic text categorization is an important task due to large amount of electronic documents. This paper presents a survey of Text categorization of Indian and non-Indian languages. There is very less work done in text categorization of Indian languages. To extract the features of documents, mostly TF-IDF (Term frequency-Inverse document frequency) method is used. Major classifiers such as SVM (support vector machine), NB (Naïve Bayes), Decision tree and K-NN (K-Nearest neighbor) are used for text categorization process. Measures used to evaluate performance of text categorization are recall, precision and fmeasure. Keywords-Text Categorization, TF-IDF, SVM, NB.
منابع مشابه
Named Entity Recognition for Code Mixing in Indian Languages using Hybrid Approach
Automating the process of Named Entity Recognition has received a lot of attention over past few years in Social Media Text. Named Entities are real world objects such as Person, Organization, Product, Location. Identifying these entities in social media text is an important challenging task due the informal nature of text present on social media. One such challenge that is faced in recognizing...
متن کاملCost Effective Dependency Parsing for Indian Languages
Indian languages are MoR-FWO1 and hence differ from English in structure and morphology. There are many distinguished characteristics possessed by Indian languages. While working with these languages we have to keep in mind, these characteristics and plan strategies accordingly. We worked on improving Dependency Parsing for Indian Languages, more specifically for Hindi, an Indo-Aryan Language. ...
متن کاملDifferent Techniques Implemented in Gurumukhi Word Sense Disambiguation
One of the most important issues in the field of Natural Language Engineering is Word Sense Disambiguation (WSD).Gurumukhi or more commonly known as Punjabi, is world’s 12th most widely spoken language and this language is morphologically rich. But surprisingly, there are relatively less efforts in the field of computerization and development of lexical resources of this language. It is therefo...
متن کاملA survey on text mining techniques
text mining is a technique to find meaningful patterns from the available text documents. The pattern discovery from the text and document organization of document is a well-known problem in data mining. Analysis of text content and categorization of the documents is a complex task of data mining. In order to find an efficient and effective technique for text categorization, various techniques ...
متن کاملA study on efficiency and productivity of Indian non-life insurers using data envelopment analysis
This paper talks about the measurement of efficiency and productivity of non-life insurance firms in India. This study is focused on twelve private non-life insurance firms and four public sector non-life insurance firms of India in the period 2008-09 to 2012-13. Data Envelopment Analysis (DEA) coupled with Malmquist productivity Index is used in measuring the efficiency as well as productivity...
متن کامل